Future Reliability Engineering

Abstract

Mojan and Fred discuss the monumental reliability challenges of a mission to Mars. While the fundamental physics remains the same, they explore why success depends on a more rigorous application of existing methods, autonomous systems, and the shift toward heavy simulation and in-situ repairs.

Key Points

Join Mojan and Fred as they discuss what it takes to build systems that must perform flawlessly for years in the extreme, unserviceable Martian environment.

Topics include:

Implementing the Fundamentals: Mars doesn’t necessarily require “new” reliability physics, but rather a perfect, thorough execution of the techniques we already know.
The Five-Year Mandate: Designing for Mars means building expensive, high-stakes hardware that must survive five years of extreme conditions with zero chance of a traditional “repair mission.”
The NASA Legacy: Reflecting on the 1950s approach, the hosts highlight how massive redundancy and deep analysis allowed early space systems to “just work” in high-risk environments.
Virtual Validation: Future space reliability will rely almost entirely on high-fidelity simulations and virtual analysis, as physical testing for every Martian variable is often impossible.
Autonomous Survival: Systems must be capable of diagnosing and repairing themselves autonomously, as the communication lag to Earth prevents real-time troubleshooting.
In-Situ Criticality: Local repairs using on-site resources will be the “insurance policy” for mission success, supported by a vast network of sensors to monitor system health.
Evolving Assumptions: Our reliability models for Mars will start with best guesses, but must be designed to adapt as we gain more data from actual Martian conditions.

Enjoy an episode of Speaking of Reliability. Where you can join friends as they discuss reliability topics. Join us as we discuss topics ranging from design for reliability techniques to field data analysis approaches.

Speaking Of Reliability: Friends Discussing Reliability Engineering Topics | Warranty | Plant Maintenance

SOR 1167 Future Reliability Engineering

00:00 /

Download Audio RSS

Transcript

Fred
Speaking of reliability, a podcast with good friends talking with you about reliability engineering topics. Welcome to Speaking of Reliability. This is Fred Schenkelberg.

Mojan
And this is Mojan Sohi. Hey, Fred.

Fred
Hey, Mojan. We got an interesting prompt.

And then when I mentioned that earlier to you, I said, oh, that sounds like I know what I’m doing with AI. But Joseph had reported a couple of broken links. And so we sorted that out.

And then he’s prepping for CRE and enjoying the website and stuff like that. And, just randomly in his message, he said, “I wonder what leaps and bounds reliability engineering will have to make when we go to Mars.” Maybe nothing new or amazing, but a very thorough implementation of failure modes analysis of what we know.

But then there are the unknown unknowns, dot, dot, dot, and then quote. Well, it’s an interesting notion. And it’s another one of these T-shirt moments for you to get your pencil ready.

One of the reasons I really like reliability engineering is it doesn’t change all that much. We have FMEAs that was in the 50s. We have block diagrams that was in the 50s.

We have fault tree analysis that was in the 50s. We have physics of failure that was started in the 50s, early 60s. We had Weibel analysis.

Weibel wrote that paper in the 50s. And on and on and on. And control charts from the 20s, like they’re over a century old.

And vast majority of the primary tools we use is nothing earth shattering or new or novel or whatever, despite all the academics at conferences going, well, here’s it. If you look at this fault tree, when you hold the page up at an angle and get the light just right, then it does this. Right?

Yeah. You’ve been there.

Mojan
You just got to squint.

Fred
Yeah, squint and wave your hand and use a fancy term and a weird symbol on it. And it’s new. Totally new.

No, it’s not. It’s such a corner case nobody ever uses. That’s one of the benefits is our techniques, the methods we use, haven’t really evolved at all.

And so what will it take to create something that’s super expensive, complex, and has to work for five years without fail, if we’re going to put people inside a capsule to send them away for six months?

Mojan
Not to mention that it’s something expensive, right? At least at the beginning, it’s going to be very expensive. So it’s, to your point earlier, I don’t think we’re not going to have the luxury of, you know, I mean, I think a lot of people, when they think about reliability, testing, they think about reliability, they think about testing.

This is not going to be something that we are going to have the luxury of to be able to test a lot just because of the cost.

Fred
Well, my preference, I’ve been thinking to the NASA in the 60s, and they started off with a single person capsule that, you know, basically jammed them in there like a sardine and just go up high enough, come back down, let’s hope the parachute works. You know, and then Gemini was two people, and they did a spacewalk, I think, you know, stuck their head out of the capsule and stuff like that. And it took them 10 years, but in 10 years, they went from, if somebody put a satellite up there, we got to do, we got to get to the moon first, for whatever reason.

And they did. But it was between Mercury and Gemini and the early Apollo, it was a lot of testing. And they basically just worked.

There were a few problems and explosions, and they had, you know, safety things that did, they had an ability, if the rocket was bad on the launch pad, that they could rocket off the capsule and save people, or they had a zip line they could jump into and get out if they could get out of the capsule, stuff like that. But it was, yeah, it wasn’t like they had 30 rockets that they launched one a month to see how far we would go, and then learn from it, like what SpaceX is doing now. SpaceX has launched more rockets, I think, in this year, in the first couple of months of this year than NASA ever did.

Mojan
Really? Wow. That’s amazing.

Fred
I’d have to double check it, but I saw a map where they’re launching like every other day, every three days.

Mojan
Wow.

Fred
Most of them are, you know, putting up satellites and doing other space junk up there. But that might be a whole nother realm of reliability is how much space junk is up there.

Mojan
Wow, you’re right. I just looked it up. As of March 10th, 2026, SpaceX has launched 30 Falcon 9 rockets this year.

Yeah.

Fred
Yeah. They’re cranking them out. And I think I saw one that because they land it and refurbish it, put it back on the launch pad.

They have one of them that’s done it like 15 times or something like that.

Mojan
It’s so fascinating, like, you know, to think about rockets as being repairable systems. It’s just a big chunk of steel, you know. Yeah, yeah, yeah, yeah.

I mean, or maybe, maybe it’s, I don’t know, some of them. I’m sure it obviously has some composites in there. But yeah, it’s, it’s just so interesting to think about them as repairable systems.

Fred
Yeah. And it’s, I mean, the boundaries and the barriers and so on, back to Joseph’s questions, aren’t anything really new to what most of our tools can implement and do. And yet it, and the other one is the, I think it was the Curiosity rover was one of the other ones.

They expected it to last, you know, only a certain amount of time. It was like a month or two, and it’s gone 15 years, done all kinds of stuff, just keep rolling around. And then it was the mess, the last message the rover sent back, it says, my batteries are critically low, and it’s getting dark.

It’s like, oh, you poor little thing. Sandstorm came.

Mojan
It’s a sad message. Yeah.

Fred
But it was the, the, and I knew some folks at JPL, and they were very pleasantly surprised. One, it worked. And two, it worked for a year.

They were just, how is this happening? Now, they got to figure out funding to keep the staff going, to keep exploring and keep doing stuff. They did not plan on it lasting at all, more than a few months.

And the amount of stuff they learned and all the stuff they did, it was amazing. But there was nothing particularly unique about how they approached the reliability of it. They designed for the environment they expected.

They designed some contingencies, some backup systems, you know, and especially in the communications type stuff. And, but enough resilience that if it ran into a boulder that they didn’t plan on, it would stop autonomously and not just try to, you know, roll over it or try to get, you know, it would wait for further instructions. But if we’re putting people on a, in a capsule to Mars, we don’t get that backup.

The time delay is not acceptable if there’s a leak, if there’s a problem.

Mojan
Right. Which means, which means, I guess, like that there’s, there’s bound to be lots of repairs in situ, right? Like you need to be able to do on-site repairs.

Hopefully, you know, some that are essentially automatic. I would imagine, you know, thinking a bit, maybe this is a bit sci-fi, but there’s bound to be lots of utilization of lots of sensors, right, on the system to be able to detect just this, this damage, damage monitoring has got to be a thing, you know, for, for a system like this. And for it to be able to automatically detect, you know, what it’s, what’s, what’s happening, if there’s something that is looking like an outlier and, you know, it’s being damaged faster than expected.

And then to be able to automatically repair itself too, you know, in many cases, I would think that has to be a part of it.

Fred
Yeah. You see that in science fiction, it’s like something hits a window on a space station and then it just kind of seals itself. But yeah, that always happens.

Yeah. Okay, cool. Also the trade-off is all of those extra sensors, all that extra computing power, all of the spare parts are all weight.

And it’s, how are you going to look, it might be three or four rockets up to a place where you assemble the kit and then you take off and go to Mars because it’s a lot easier to move in space than it is off the surface.

Mojan
Yeah.

Fred
I think that was why people were looking at going to the moon because it would be easier, easier as a launch pad, especially if they could get water and fuel there.

Mojan
Yes, exactly. You need, you need a, like a, to be able to do resupply missions without having to go like completely in and out of orbit. I would imagine every time you need to like, you know, send something there.

So somewhere like the moon seems like the easier, easier option to be able to do that.

Fred
But then you got more of these unknown unknowns. It’s like, all right, we’re making fuel from moon dust. Okay.

What possibly could go wrong? It’s just the, the ability, I mean, engineers are great at thinking of all the, you just get a couple of creative engineers together and do some brainstorming and you’ll never build anything because it’d be too scary. And so it’s, yeah, we can anticipate all kinds of things yet.

It doesn’t cost a whole fortune to send an, you know, an equivalent orbit, equivalent, you know, size of a piece of equipment unmanned, but censored up to find out what the hazards are. What are the environments? How does this radiation that is away from our envelope of earth really affects everything that’s going on, all those kinds of things, and then send up a brave solar tool that’s, that will give it a try.

Mojan
Right. And there has to be, you know, as we just normally do in DFMEs, there’s gotta be, you know, obviously differentiation between like, okay, these failure modes just absolutely cannot happen basically. Or if they happen, it’s essentially like, you know, catastrophic and it could result in loss of life, which would obviously be horrible.

And then other failure modes are like, well, okay, like how much time, you know, that, that could be like a degradation thing that, you know, gives me some time to be able to react to it. And it’s not necessarily immediately catastrophic. And so you have to, you know, obviously you have to think about what to do for those super high severity ones and how to prevent those, which to your point earlier, I think would mean probably, you know, that’s where you’re going to, if you’re going to have to, you know, use up your weight distribution, you’re probably going to allocate more sensors and analysis to those things than the other failure modes.

Fred
Yeah. Because if you have time to get back to earth, let those computers and engineers help you. Whereas if it’s, no, I got to fix this now.

You have to have the right equipment. Now that’s part of the maintenance part of it, right? But we know about maintenance.

We know about repair systems and censoring, looking for failures or signals and stuff like that. Do you think, is there anything new or different? I’m thinking of the genesis of a lot of the big tools that we use now, like FMEA and block diagrams and that kind of stuff was in the Navy submarine building programs in the fifties and sixties.

And they realized, oh, we actually need to pay attention to this reliability thing. And they basically invented a handful of these tools and made them more rigorous and useful. And it was also in project management tools too.

A lot of that came out of those kinds of programs because it was a very large, very complex system on a tight timeline and they had to get it right because there’s a hundred people in the submarine and it’s like if something goes wrong, it’s really bad. But I think that goes back to Joseph’s question. If we’re running into one of these challenges, another way to think of it is, are the tools we use now adequate to the challenge?

Mojan
Yeah. To your point, it’s like, does the math work? Are there conditions where we need a new math to be able to figure that out?

I don’t know.

Fred
Well, I’m thinking of the, I mean, right now we have really poor tools, mostly because most of the time we make a lot of assumptions, right? And we simplify things down. Alex did a webinar on Tuesday this week.

And when this comes out, it’ll be last month. So if I have to time jump there real quick, well, that would be a nice tool. If we could have a time machine to go 20 years in the future and find out all of the ways that our product fails and then we come back and redesign it, that would be way cool.

Yet it’s, if what we have now is too simplified and we’ve got great computing power now, you know, allegedly, not according to Star Trek standards, but we have computers now and vibe coding. Don’t forget vibe coding. So what kinds of tools can get rid of the assuming everything’s independent that we can use?

So large language models and neural networks and stuff like that can handle that. A lot of people don’t trust it because we don’t know how it did what it did and get the answers we got yet. I think adopting more of these tools that help us master the avoid making the simplifying assumptions.

Because I mean, finite element analysis for mechanical engineers is helped a whole lot. You know, I’ve run into engineers that say, well, I assume this is where the problem is. Well, what’s the analysis say?

Oh, it’s not there. It’s over here. Okay.

And then we go test it and it’s, yeah, it’s where the model said. So it’s things like that. What tools will help us get past the limitations that our current methods have?

Mojan
Yes. To your point, a lot of the limitations for our current tools are assumptions that we make. And if we can know more about those assumptions, it will reduce the number of assumptions.

It will go a long way in being able to improve our understanding and, you know, then have a good understanding of what the risk is, basically, so we can quantify it.

Fred
I think the one thing that, I mean, I didn’t expect that we would sit down and go, oh, here’s the 15 things that will be breakthrough reliability improvements. But I am a little disappointed in you, Majan. You didn’t come.

Mojan
No.

Fred
I mean, the assumption part is a piece of that. And there are bits and pieces of different tools that engineering folks are using that help us do that. The one I don’t see very often, and you see it really, really rarely, and many of our tools aren’t even built to handle it, is like the finite element analysis.

It works great with a perfect steel bar, right? Or here’s our structure, and we make simplifying assumptions about the fiberglass that’s used in this circuit board, for example. If we could include the traces and the components and all those pieces on it and age it 10 years, the solder joints have cycled 10,000 times, and it’s gone through thermal cycling, and it’s got radiation exposure to it.

And a lot of those individual stresses, we understand or should, of how it affects things, and then run the stress analysis. And it’s artificially, you know, simulate the aging process and then run your stress systems. Can it still work there?

The tools we have right now don’t make that easy. And so I’m thinking it’s going to be incorporating the physics of failure type knowledge into the simulation tools, whether it’s for mechanical or electrical signaling or, you know, ability to withstand a strain or a stress that is on the, you know, like a large event that occurs, how resilient are you to it? And it’s, so that’s one area I can think of that may be our breakthrough, and it’s based on computing power, obviously.

Mojan
I agree, but I would say the first, I mean, I think like a lot of that does happen today already as well. Like we know, you know, I mean, at least in my experience, we’ve used degraded material properties in simulations and the virtual tools to understand how does it perform, you know, after the material has degraded and it’s been aged. I think that, yeah, but I think that prior to that, though, I think that we do need new materials.

Like that’s going to be the first thing I think is going to be like a lot of the existing materials. I think, you know, like you said, like we need optimization around weight, like weight is a big restriction. And I think one of the big changes is going to be now, whether it’s strictly reliability engineering or not, I don’t know, but, you know, material engineering may want to think otherwise.

But I think the first thing is going to be the, what, you know, AI and large language models and just more compute power enables is going to be inventing new materials. That is, I think, going to be one of the big things that will be improving and pushing, you know, accelerating this advancement, because I do think we’re going to need new materials for lots of things for weight optimization, for thermals, you know, for, um, just, just, yeah, radiation. Yeah, exactly.

For, for being able to, for repairability for, you know, you can’t exactly weld necessarily very easily. Um, and for, for, you know, talking about the, you know, like there’s some work on self healing materials, but it has a long way to go. And that I think is going to be like, you know, big advancements there in that area are going to be needed to be able to, um, yeah, sustain, you know, basically like a city or some number of humans to be able to live on Mars.

Um, so.

Fred
Yeah, I, yeah, I, I agree. And I, but I cringe because the, we just spent a whole pile of money putting a new roof on this house that we’re in because the previous roof about 15 years ago was made of a new material, a composite. And, and it was made to, it was a press fiber pipe thing that simulated the look of slate that was a 10th the weight.

So it was weight saving. So you could put slate looking roofs on, on structures that weren’t designed to hold the weight. And for whatever reason, this company didn’t anticipate that these things would be in the sun all day in thermal cycling.

And so after about two years, they start to crack and fall off. And so they became known as the potato chip roof. Oh no.

And they of course got sued and went out of business and did all that other stuff. And I mean, a bird would, you know, land on our roof and an eight inch square chunk of the tile and come ripping down the, it’s an A-frame. So it’s a really steep roof.

It’d come ripping down and smash into the yard. So at some point it was like, we got to fix this. Yeah.

In that case, it was pretty darn clear that their testing was, they put it up on the roof for a couple of days and say, Hey, see, it works. And it, they didn’t let that, I don’t know whether it was the binder they used, or it was the UV degradation or what exactly the mechanism was. You can take a look at it when you bring your FA lab with you.

Yes.

Mojan
I’ll bring, I’ll bring my microscopes and my, my analyzers with me. Yes.

Fred
Yeah. And then, but they probably designed it as if it was the standard for the standard asphalt shingles that have been standard roofing for 50 years. And they probably use that standard international standard or building standard or whatever to say, Oh, we need that.

And thought they were good. And so I, part of the change in reliability is that maturity piece and recognizing that, yeah, we got to use the latest, greatest tools and techniques and, and aging our materials. And if we do a new material, it’s, it’s not proven just because it met our old standard.

It has to prove itself in this different environment over time. And that’s going to be a challenge. How do we go about doing that?

When we, we had somebody that says, no, we want to be there next week. Come on, let’s go. Yes.

Mojan
As the very typical product timelines, uh, you know, we’ll get in the, we’ll, we’ll push the, we’ll push the schedule to possibly skip some tests or maybe skip some thinking. Yeah, exactly.

Fred
Yeah.

Mojan
Yeah. Is that, is that really a real failure mode? Is that, is that really going to happen?

I don’t think it’s a real risk.

Fred
Yeah.

Mojan
But you know, on, on that note, I think talking about the bringing the lab over, I do think that the, that being able to analyze the materials and, you know, having the techniques that, uh, are required to, to be able to tell, you know, to, to be able to analyze if something has failed, that is also going to need to be something that is, that should be, you know, will need to be in situ basically, because you’re going to need to be able to do the failure analysis on site.

Um, which means that the analyzers need to be advanced in such a way so that they are, you know, more capable and smaller in size and way less, uh, to be able to, to transfer them. And they, they will need to be, you know, to operate probably more, you know, just simply, I think like right now, a lot of analyzers still require, you know, that you have this giant tank of something, you know, some, you know, whether it’s Argon or some other, right.

Fred
Exotic material, some sort of temperature and everything else.

Mojan
Yeah, exactly. Exactly. Um, which, you know, that, that stuff I think is going to need to be simplified, which again, I think that with the advancement of materials, um, given the use of, um, you know, more compute power, I think is going to, is going to be enabled.

Um, man, I’m so excited about this now, the more we talk about it, the more I’m like, oh, these amazing things are going to happen. It’s very, very cool.

Fred
We’ll get an upgrade to Velco finally. I mean, it is amazing in the sixties, how many different materials and types of equipment and, and even computer programming, all of that stuff just went gangbusters. And since it was a government funded program, it all was public domain.

And, you know, imagine if somebody, you know, had in a private company invented Velcro and realized that it was a money maker, it would cost us a fortune to get Velcro shoe tabs for our kids. It just costs a fortune, but it, it, it had all kinds of spinoff benefits other than just going to the moon. So I think that was kind of cool.

Um, if only we could get our cars to be self healing and land after they, you know, get an accident and get that put back on the road the day later that there’s something to learn from how they turn these rockets around.

Mojan
Yeah. Uh, and, and I think, you know, I wonder if it would be like a combination of, uh, just learnings from, from, from different, different industries too, you know, like how fast, how fast they’re able to, uh, you know, for formula, for example, like how fast they’re able to figure out what needs to be changed and, you know, at the pit and change it. And like, it’s just like, I think there’s a lot of learnings to be made from both low volume and high volume industries and simple.

Yeah, exactly. Exactly. And just, just, you know, I think the advancement again in, um, you know, there’s, there’s companies that are, that are enabling faster and better simulations now with that, you know, there’s more compute power.

Um, and the same thing will, will be enabled for cat, right? Like if you’re able to design in CAD faster than you can, and, and, you know, come up with more complex, complex designs and, uh, combine that with materials so much will change that will, that will be a big Delta for how we are able to prototype. And, you know, to your point, maybe, maybe we, that does enable some testing.

Um, and, and then of course the simulation side is advancing too. So, okay. Okay.

I’m convinced this is going to be, there’s, there’s going to be so many changes.

Fred
There are, and it’s, you know, in the reliability world is, you know, what are you seeing? What, what can we influence or advance or make happen? Um, it’s definitely not going to be a silo of just reliability engineers.

It’s going to be with computer scientists and, and, and material scientists and all the other engineering disciplines, plus human factors. How do we get a person to learn? Not only the 3 million switches on the control panel, but now you’ve got to do material analysis type stuff and fabricate new materials on the fly.

Um, yeah, it’s, it’s going to be a challenge. And I think the more we think about it and come up with ideas, the more that things, something will hit and it’ll, it’ll make a difference.

Mojan
Yeah, absolutely. And you know, this is making me think about, uh, Project Hail Mary. I don’t know.

Have you read, have you read the book Project Hail Mary?

Fred
No, no.

Mojan
Okay. Well, it is awesome. Uh, it’s about, uh, it’s, it’s a sci-fi, um, and the movie’s coming out soon and I’m very excited to watch it.

We’ll talk all about it, but it’s a, if I would highly recommend the book if you haven’t. It’s fascinating.

Fred
All right, cool. Well, we got a podcast and a book recommendation all in one day. All right, cool.

But if, you know, if you’re listening to this and you’re like, well, what about this? What about that? Let us know.

I think this would be a really cool topic for an ongoing discussion. Um, so, you know, send over your comments and ideas or what you’re seeing or what you’re running into, especially you rocket scientists out there that are actually working on this problem, whether you’re in Jet Propulsion Lab or NASA or in SpaceX or one of the other rocket companies or in a completely different industry. What kind of breakthroughs are you seeing that’s going to change the way we go about doing reliability engineering?

That would be fascinating to, to get more discussion on it, more people involved with it. So let us know, head over to ascendoreliability.com slash go slash SOR. There’s a couple of ways to get in touch with us.

Mojan and I, and the other hosts, when we’re not off watching or reading science fiction books, um, are available on LinkedIn and our about pages. So plenty of ways for you to get in touch with us. So, uh, we look forward to your ideas and comments and like what Joseph sent us was just a concept or an idea or a wonder is like I’m wondering about kind of thing.

Those are fine too. All good stuff.

Mojan
Yeah, absolutely.

Fred
All right. Well, thanks so much, Mojan, for entertaining this question. And well, I’m sure we’re going to talk about this and variations of it for a while now.

Mojan
Yeah, absolutely.

Future Reliability Engineering

Abstract

Key Points

About Mojan Sohi

Leave a Reply Cancel reply